42 research outputs found

    Predicting genome-wide redundancy using machine learning

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene duplication can lead to genetic redundancy, which masks the function of mutated genes in genetic analyses. Methods to increase sensitivity in identifying genetic redundancy can improve the efficiency of reverse genetics and lend insights into the evolutionary outcomes of gene duplication. Machine learning techniques are well suited to classifying gene family members into redundant and non-redundant gene pairs in model species where sufficient genetic and genomic data is available, such as <it>Arabidopsis thaliana</it>, the test case used here.</p> <p>Results</p> <p>Machine learning techniques that combine multiple attributes led to a dramatic improvement in predicting genetic redundancy over single trait classifiers alone, such as BLAST E-values or expression correlation. In withholding analysis, one of the methods used here, Support Vector Machines, was two-fold more precise than single attribute classifiers, reaching a level where the majority of redundant calls were correctly labeled. Using this higher confidence in identifying redundancy, machine learning predicts that about half of all genes in <it>Arabidopsis </it>showed the signature of predicted redundancy with at least one but typically less than three other family members. Interestingly, a large proportion of predicted redundant gene pairs were relatively old duplications (e.g., Ks > 1), suggesting that redundancy is stable over long evolutionary periods.</p> <p>Conclusions</p> <p>Machine learning predicts that most genes will have a functionally redundant paralog but will exhibit redundancy with relatively few genes within a family. The predictions and gene pair attributes for <it>Arabidopsis </it>provide a new resource for research in genetics and genome evolution. These techniques can now be applied to other organisms.</p

    Predictive network modeling of the high-resolution dynamic plant transcriptome in response to nitrate

    Get PDF
    International audienceABSTRACT: BACKGROUND: Nitrate, acting as both a nitrogen source and a signaling molecule, controls many aspects of plant development. However, gene networks involved in plant adaptation to fluctuating nitrate environments have not yet been identified. RESULTS: Here we use time-series transcriptome data to decipher gene relationships and consequently to build core regulatory networks involved in Arabidopsis root adaptation to nitrate provision. The experimental approach has been to monitor genome-wide responses to nitrate at 3, 6, 9, 12, 15 and 20 minutes, using Affymetrix ATH1 gene chips. This high-resolution time course analysis demonstrated that the previously known primary nitrate response is actually preceded by a very fast gene expression modulation, involving genes and functions needed to prepare plants to use or reduce nitrate. A state-space model inferred from this microarray time-series data successfully predicts gene behavior in unlearnt conditions. CONCLUSIONS: The experiments and methods allow us to propose a temporal working model for nitrate-driven gene networks. This network model is tested both in silico and experimentally. For example, the over-expression of a predicted gene hub encoding a transcription factor induced early in the cascade indeed leads to the modification of the kinetic nitrate response of sentinel genes such as NIR, NIA2, and NRT1.1, and several other transcription factors. The potential nitrate /hormone connections implicated by this time-series data is also evaluated

    Qualitative network models and genome-wide expression data define carbon/nitrogen-responsive molecular machines in Arabidopsis

    Get PDF
    BACKGROUND: Carbon (C) and nitrogen (N) metabolites can regulate gene expression in Arabidopsis thaliana. Here, we use multinetwork analysis of microarray data to identify molecular networks regulated by C and N in the Arabidopsis root system. RESULTS: We used the Arabidopsis whole genome Affymetrix gene chip to explore global gene expression responses in plants exposed transiently to a matrix of C and N treatments. We used ANOVA analysis to define quantitative models of regulation for all detected genes. Our results suggest that about half of the Arabidopsis transcriptome is regulated by C, N or CN interactions. We found ample evidence for interactions between C and N that include genes involved in metabolic pathways, protein degradation and auxin signaling. To provide a global, yet detailed, view of how the cell molecular network is adjusted in response to the CN treatments, we constructed a qualitative multinetwork model of the Arabidopsis metabolic and regulatory molecular network, including 6,176 genes, 1,459 metabolites and 230,900 interactions among them. We integrated the quantitative models of CN gene regulation with the wiring diagram in the multinetwork, and identified specific interacting genes in biological modules that respond to C, N or CN treatments. CONCLUSION: Our results indicate that CN regulation occurs at multiple levels, including potential post-transcriptional control by microRNAs. The network analysis of our systematic dataset of CN treatments indicates that CN sensing is a mechanism that coordinates the global and coordinated regulation of specific sets of molecular machines in the plant cell

    An integrated genetic, genomic and systems approach defines gene networks regulated by the interaction of light and carbon signaling pathways in Arabidopsis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Light and carbon are two important interacting signals affecting plant growth and development. The mechanism(s) and/or genes involved in sensing and/or mediating the signaling pathways involving these interactions are unknown. This study integrates genetic, genomic and systems approaches to identify a genetically perturbed gene network that is regulated by the interaction of carbon and light signaling in Arabidopsis.</p> <p>Results</p> <p>Carbon and light insensitive (<it>cli</it>) mutants were isolated. Microarray data from <it>cli186 </it>is analyzed to identify the genes, biological processes and gene networks affected by the integration of light and carbon pathways. Analysis of this data reveals 966 genes regulated by light and/or carbon signaling in wild-type. In <it>cli186</it>, 216 of these light/carbon regulated genes are misregulated in response to light and/or carbon treatments where 78% are misregulated in response to light and carbon interactions. Analysis of the gene lists show that genes in the biological processes "energy" and "metabolism" are over-represented among the 966 genes regulated by carbon and/or light in wild-type, and the 216 misregulated genes in <it>cli186</it>. To understand connections among carbon and/or light regulated genes in wild-type and the misregulated genes in <it>cli186</it>, the microarray data is interpreted in the context of metabolic and regulatory networks. The network created from the 966 light/carbon regulated genes in wild-type, reveals that <it>cli186 </it>is affected in the light and/or carbon regulation of a network of 60 connected genes, including six transcription factors. One transcription factor, HAT22 appears to be a regulatory "hub" in the <it>cli186 </it>network as it shows regulatory connections linking a metabolic network of genes involved in "amino acid metabolism", "C-compound/carbohydrate metabolism" and "glycolysis/gluconeogenesis".</p> <p>Conclusion</p> <p>The global misregulation of gene networks controlled by light and carbon signaling in <it>cli186 </it>indicates that it represents one of the first Arabidopsis mutants isolated that is specifically disrupted in the integration of both carbon and light signals to control the regulation of metabolic, developmental and regulatory genes. The network analysis of misregulated genes suggests that <it>CLI186 </it>acts to integrate light and carbon signaling interactions and is a master regulator connecting the regulation of a host of downstream metabolic and regulatory processes.</p

    GraphFind: enhancing graph searching by low support data mining techniques

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Biomedical and chemical databases are large and rapidly growing in size. Graphs naturally model such kinds of data. To fully exploit the wealth of information in these graph databases, a key role is played by systems that search for all exact or approximate occurrences of a query graph. To deal efficiently with graph searching, advanced methods for indexing, representation and matching of graphs have been proposed.</p> <p>Results</p> <p>This paper presents GraphFind. The system implements efficient graph searching algorithms together with advanced filtering techniques that allow approximate search. It allows users to select candidate subgraphs rather than entire graphs. It implements an effective data storage based also on low-support data mining.</p> <p>Conclusions</p> <p>GraphFind is compared with Frowns, GraphGrep and gIndex. Experiments show that GraphFind outperforms the compared systems on a very large collection of small graphs. The proposed low-support mining technique which applies to any searching system also allows a significant index space reduction.</p

    Cell-by-cell dissection of phloem development links a maturation gradient to cell specialization

    Get PDF
    Publisher Copyright: Copyright © 2021 The Authors, some rights reserved;In the plant meristem, tissue-wide maturation gradients are coordinated with specialized cell networks to establish various developmental phases required for indeterminate growth. Here, we used single-cell transcriptomics to reconstruct the protophloem developmental trajectory from the birth of cell progenitors to terminal differentiation in the Arabidopsis thaliana root. PHLOEM EARLY DNA-BINDING-WITH-ONE-FINGER (PEAR) transcription factors mediate lineage bifurcation by activating guanosine triphosphatase signaling and prime a transcriptional differentiation program. This program is initially repressed by a meristem-wide gradient of PLETHORA transcription factors. Only the dissipation of PLETHORA gradient permits activation of the differentiation program that involves mutual inhibition of early versus late meristem regulators. Thus, for phloem development, broad maturation gradients interface with cell-type-specific transcriptional regulators to stage cellular differentiation.Peer reviewe

    Rational Design of Temperature-Sensitive Alleles Using Computational Structure Prediction

    Get PDF
    Temperature-sensitive (ts) mutations are mutations that exhibit a mutant phenotype at high or low temperatures and a wild-type phenotype at normal temperature. Temperature-sensitive mutants are valuable tools for geneticists, particularly in the study of essential genes. However, finding ts mutations typically relies on generating and screening many thousands of mutations, which is an expensive and labor-intensive process. Here we describe an in silico method that uses Rosetta and machine learning techniques to predict a highly accurate “top 5” list of ts mutations given the structure of a protein of interest. Rosetta is a protein structure prediction and design code, used here to model and score how proteins accommodate point mutations with side-chain and backbone movements. We show that integrating Rosetta relax-derived features with sequence-based features results in accurate temperature-sensitive mutation predictions

    A Systems Approach Uncovers Restrictions for Signal Interactions Regulating Genome-wide Responses to Nutritional Cues in Arabidopsis

    Get PDF
    As sessile organisms, plants must cope with multiple and combined variations of signals in their environment. However, very few reports have studied the genome-wide effects of systematic signal combinations on gene expression. Here, we evaluate a high level of signal integration, by modeling genome-wide expression patterns under a factorial combination of carbon (C), light (L), and nitrogen (N) as binary factors in two organs (O), roots and leaves. Signal management is different between C, N, and L and in shoots and roots. For example, L is the major factor controlling gene expression in leaves. However, in roots there is no obvious prominent signal, and signal interaction is stronger. The major signal interaction events detected genome wide in Arabidopsis roots are deciphered and summarized in a comprehensive conceptual model. Surprisingly, global analysis of gene expression in response to C, N, L, and O revealed that the number of genes controlled by a signal is proportional to the magnitude of the gene expression changes elicited by the signal. These results uncovered a strong constraining structure in plant cell signaling pathways, which prompted us to propose the existence of a “code” of signal integration

    An expanded evaluation of protein function prediction methods shows an improvement in accuracy

    Get PDF
    Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent. Keywords: Protein function prediction, Disease gene prioritizationpublishedVersio

    An Expanded Evaluation of Protein Function Prediction Methods Shows an Improvement In Accuracy

    Get PDF
    Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent
    corecore